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Abstract 

This paper proposes a human activity recogni¬ 
tion method which is based on features learned 
from 3D video data without incorporating do¬ 
main knowledge. The experiments on data col¬ 
lected by RGBD cameras produce results outper¬ 
forming other techniques. Our feature encoding 
method follows the bag-of-visual-word model, 
then we use a SVM classiher to recognise the ac¬ 
tivities. We do not use skeleton or tracking in¬ 
formation and the same technique is applied on 
color and depth data. 


1. Introduction 

Human activity recognition leverage various sensing tech¬ 
nologies and provide a vast range of potential applications. 
Researchers in computer vision have reached a large num¬ 
ber of achievements in activity analysis (Aggarwal & Ryoo, 
2011). However, the vision-based approach suffers from is¬ 
sues related to obtrusiveness and complexity of real-world 
settings. Low-cost RGBD cameras, such as Microsoft 
Kinect devices, provide both color and depth information, 
which can improve activity recognition in accuracy and ro¬ 
bustness (Aggarwal & Xia, 2014). 

In this paper, we propose a method to extract features from 
RGBD videos. The key component of our approach is a In¬ 
dependent Subspace Analysis network (Le et ah, 2011) Our 
approach does not rely on hand-crafted features, but learns 
discriminative features from the input data. The same 
method is performed to extract features from all modalities. 
Our proposed method directly exploit color (grayscale) and 
depth data to extract features for human activity recogni¬ 
tion. Moreover, there is no domain knowledge that is incor¬ 
porated in this method. That means it is possible to apply 
the technique for various applications. Chen et al. (Chen 
et al., 2014) also utilized ISA networks but they worked 
only on depth data and relied on skeleton data to extract 


features. Our method directly operates on color and depth 
data. It is essential because the skeleton information is not 
always available, due to sensor noises or self-occlusions of 
human bodies, especially in collaborative activities involv¬ 
ing two or more persons. 

The rest of this paper is organized as follows. In Sec¬ 
tion 2, we present the ISA-based feature learning method. 
The performance of our approach is evaluated in Section 3 
through experimental results on interaction activities be¬ 
tween two persons. Finally, we conclude the paper and 
introduce possible extension in Section 4. 

2. Feature Learning using Independent 
Subspace Analysis 

Independent Subspace Analysis (ISA), an extension of In¬ 
dependent Component Analysis (ICA), is widely-used in 
the field of natural image statistics (Hyvrinen et al., 2009). 
ISA can extract features that are invariant to local transla¬ 
tion and selective to frequency, rotation and velocity (Le 
et al., 2011). Le et al. (Le et al., 2011) adapted the 
technique to recognize activities in videos. Their method 
achieved state-of-the-art at that time on well-known bench¬ 
mark datasets. Multiple ISA layer can be stacked to form 
an ISA network, where outputs of one layer are inputs of 
the above layer. Thus, the ISA network is possible to learn 
a hierarchical representation of the input data. Bag-of-word 
model can be applied to form feature vectors for a classifi¬ 
cation algorithm. 

Using the method proposed by Le et al. (Le et al., 2011), 
we extract the features from unlabeled data of each sin¬ 
gle modality, e.g grayscale and depth video. In this paper, 
we process grayscale and depth data in the same way (Fig¬ 
ure 1). We hrst take a spatial-temporal data block from 
each modality and flatten it frame-by-frame into a vector. 
Then, this vector is the input of the ISA network. To train 
the ISA network, we use batch projected gradient descent, 
which is the same technique in (Le et al., 2011). Finally, we 
use the pre-trained network and k-means algorithm to gen¬ 
erate the histogram-based feature vectors for a SVM-based 
classifier. 
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Figure 1. Our proposed framework, which learns discriminative 
features from color and depth data 


3. Experimental Results 

We performed experiments on the second subset OA2 
of the Office Activity dataset (Wang et ah, 2014), 
which contains ten interaction activities: Asking-and-away, 
Called-away, Carrying, Chatting, Delivering, Eating-and- 
chatting. Having-guest, Showing, Seeking-help, Shaking- 
hands. Ten subjects performed these activities in two dif¬ 
ferent offices. The authors divided the dataset according to 
the name of one person in each interaction session. 

Our experiments are implemented on a desktop computer 
with Intel Core i7 CPU and 16 GB RAM. We first pre-train 
the ISA network with unlabeled data. Then, we extract fea¬ 
tures using the network and form the bag-of-visual-word 
representation for each video clip. Finally, a SVM-based 
classifier is used to recognize the activities. We used the 
RBF kernel and performed grid search to select the opti¬ 
mal parameters. We performed experiments on the OA2 
dataset (Wang et al., 2014) with image size 80 x 60. The 
features extracted by our two-layer ISA network are dis¬ 
tributed into 100 clusters, from which we build the bag- 
of-word model. Thus, each videos are represented with a 
100-dimension vector. We after that follow the same cross- 
validation scheme as described in (Wang et al., 2014) (i.e 
leave-one-person-out). 

Table 1 and Table 2 show the average accuracy of leave- 
one-person-out validation scheme on the OA2 dataset, 
comparing with state-of-the-art results provided in the pa¬ 
per of Wang et al. (Wang et al., 2014). Most confusion is 
due to the semantic meaning of activities (Figure 2). For 


Table 1. Average accuracy on the OA2 dataset (Wang et al., 2014) 



(Wang et al., 2014) 

Ours 

Grayscale 

41.6% 

60.0% 

Depth 

43.6% 

50.8% 

Grayscale & Depth 

45.0% 

61.3% 


Table 2. Comparison of the accuracy per activity category on the 
OA2 dataset with the results in (Wang et al., 2014). DCSF, Con- 
vNet, and ConfNet are the methods of (Xia & Aggarwal, 2013), 
(Ji et al., 2013), and (Wang et al., 2014), respectively. 



DCSF 

ConvNet 

ConfNet 

Ours 

Ask 

12.5% 

39.6% 

25.3% 

44.7% 

Call 

45.8% 

44.8% 

57.5% 

60.5% 

Carry 

66.7% 

56.8% 

53.5% 

73.7% 

Chat 

37.5% 

17.2% 

25.3% 

36.8% 

Deliver 

20.1% 

34.5% 

32.8% 

50.0% 

Eat&Chat 

50.0% 

35.8% 

69.5% 

86.8% 

HaveGuest 

37.5% 

34.1% 

43.7% 

86.8% 

SeekHelp 

16.7% 

44.8% 

59.2% 

68.4% 

ShakeHands 

41.7% 

32.8% 

59.8% 

60.5% 

Show 

37.5% 

29.3% 

23.0% 

44.7% 


instance. Ask is very similar with Call in real life. To solve 
this issue, we may integrate new modalities into the system, 
e.g microphone for audio recording. 
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Eat&Chat 

HaveGuest 

SeekHelp 

ShakeHands 

Show 










. 6 ^ 




44.7 
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13.2 
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7.9 

36.8 
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7.9 

13.2 

13.2 

13.2 
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50.0 

2.6 

18.4 

7.9 


2.6 
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86.8 
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2.6 
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86.8 
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Figure 2. Confusion matrix on OA2 dataset (Wang et al., 2014) 


4. Conclusion 

In this paper, we introduce a unsupervised learning method 
that extracts features from color and depth data. The pro¬ 
posed technique is used to recognize human activities in 
RGBD videos. The features are extracted from raw color 
and depth data, without relying on skeleton or tracking 
information. Our experimental results show that using 
learned features helps to improve the classification accu¬ 
racy. Moreover, the approach is generic enough to apply in 
other applications. In future, more sensing modalities (e.g. 
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microphone) can be integrated to improve the accuracy of 

our proposed approach. 
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